## Google

# Static Performance Analysis with LLVM

#### **Clément Courbet**

G. Chatelet, B. De Backer, O. Sykora,

Google Compiler Research







Low-precision matrix multiplication <a href="mailto:github.com/google/gemmlowp">github.com/google/gemmlowp</a>



Optimize this at (nearly) all costs!

#### Tools

#### Benchmarks

- Closer to real-life performance
- Slooooow
- Requires access to hardware

#### Static Analysis

- Fast
- Reproducible
- Hard to model input-dependent behaviour (branches, cache)

### Our Static Performance Analyzer



#### Simulator API

Target-independent <u>Simulator</u> interface:

```
const vector<MCInst>& BasicBlock = ...;
auto Simulator = Target->createSimulator();
SimulationLog Log = Simulator.Run(BasicBlock);
```

http://www.realworldtech.com/haswell-cpu/6/

### Simulator Internals: Components

Generic →
 Reuse LLVM's target-independent descriptions, e.g.

Scheduler: 11vm::MCSchedMode1

RegisterRenamer: 11vm::MCRegisterInfo

Target-specific, e.g. Intel <u>Fetcher</u>



analyzing 'libyuv\_sumsquareerrorsse2.s'

ran 20 iterations in 132 cycles

Block Inverse Throughput: [5-6] cycles per iteration

### Analysis: IACA-like frontend

#### Port Pressure (cycles per iteration):

| Port Pressure (Cycles per Iteration): |          |           |                  |              |         |         |           |                |                |         |                                                         |
|---------------------------------------|----------|-----------|------------------|--------------|---------|---------|-----------|----------------|----------------|---------|---------------------------------------------------------|
|                                       | Port     | HWDivider | HWPort0          | HWPort1      | HWPort2 | HWPort3 | HWPort4   | HWPort5        | HWPort6        | HWPort7 | -<br>                                                   |
|                                       | Cycles   |           | 4.35             | 4.40         | 1.00    | 1.00    |           | 4.30           | 1.95           |         |                                                         |
| _                                     |          |           |                  |              |         |         |           |                |                |         |                                                         |
|                                       | #Uops    | HWDivider | HWPort0          | HWPort1      | HWPort2 | HWPort3 | HWPort4   | HWPort5        | HWPort6        | HWPort7 |                                                         |
| -<br> <br>                            | 1  <br>1 | <br> <br> | <br>   <br>      | 0.30         | 1.00    | <br>    | <br> <br> | 0.70           |                |         | movdqu xmm1, xmmword ptr [rdi]<br>lea rdi, [rdi + 0x10] |
|                                       | 1  <br>1 |           | <br>             | 0.55         | <br>    | 1.00    |           | 0.45           | <br>           | <br>    | movdqu xmm2, xmmword ptr [rsi]<br>lea rsi, [rsi + 0x10] |
|                                       | 1  <br>1 |           | 0.35  <br>       | 0.65<br>0.70 | <br>    |         |           | 0.30           | <br>           |         | movdqa xmm3, xmm1<br>psubusbxmm1, xmm2                  |
| į                                     | 1  <br>1 |           | <br>  0.95       | 0.40<br>0.05 | İ       | į       | İ         | 0.60           | İ              | į       | psubusb xmm2, xmm3<br>por xmm1, xmm2                    |
| ļ                                     | 1        |           | 1.00             | 0.03         |         |         |           | 4 00           |                | ļ       | movdqa xmm2, xmm1                                       |
|                                       | 1  <br>1 |           | <br>             |              |         |         |           | 1.00  <br>1.00 |                |         | punpcklbw xmm1, xmm5<br>punpckhbw xmm2, xmm5            |
|                                       | 1  <br>1 |           | 1.00  <br>  1.00 |              | <br>    |         |           |                |                |         | <pre>pmaddwd xmm1, xmm1 pmaddwd xmm2, xmm2</pre>        |
| į                                     | 1  <br>1 |           | <br>             | 0.75<br>1.00 | <br>    | į       | į         | 0.25           | j<br>          | į       | paddd xmm0, xmm1<br>paddd xmm0, xmm2                    |
| ļ                                     | 1        |           | 0.05             |              |         |         |           |                | 0.95  <br>1.00 | į       | sub ecx, 0x10<br>jg .Ltmp0                              |
| - 1                                   | ±        |           | ı                |              | ı       |         |           |                | 1.00           | 1       | JB • F CIIIPO                                           |

### **Automatic Scheduling**

Minimize the simulated latency (alternative to PostRAMachineSched)

- Random/Exhaustive Search (<u>CP Solver</u>) ~hours-days
- Genetic Algorithms (Biased Random-Key Genetic Algorithms) < seconds</li>



#### **Exhaustive Search**

gemmlowp's SSE4\_32\_Kernel4x4Depth2:
 0-2% faster (vs. implementation contributed by Intel).

libwebp's FTransform():
 0-5% faster.

On benchmarks; no performance regressions

### Genetic Algorithms

100 milliseconds

10% improvement (in theory)

```
Original: 9.5 Cycles/Iter
                                       Rescheduled: 8.5 Cycles/Iter
     movd mm0, dword ptr [r8]
                                            movd mm0, dword ptr [r8]
     punpcklbw mm0, mm7
                                            punpcklbw mm0, mm7
     movq mm1, mm0
                                            movq mm1, mm0
                                            pmullw
     movq mm2, mm0
                                                       mm1, mm1
     pmullw
                mm1, mm1
                                            movq mm2, mm0
     paddw mm1, mm3
                                            pmullw
                                                       mm2, mm6
                                            paddw mm1, mm3
     psrlw mm1, 0x8
     pmullw
                                            psrlw mm1, 0x8
                mm0, mm1
     paddw mm0, mm3
                                            pmullw
                                                       mm0, mm1
     psrlw mm0, 0x8
                                            pmullw
                                                       mm1, mm5
     pmullw
                mm0, mm4
                                            paddw mm0, mm3
     pmullw
                mm1, mm5
                                            psrlw mm0, 0x8
     pmullw
                mm2, mm6
                                            pmullw
                                                       mm0, mm4
     paddw mm0, mm1
                                            paddw mm0, mm1
     paddw mm0, mm2
                                            paddw mm0, mm2
     psrlw mm0, 0x6
                                            psrlw mm0, 0x6
     packuswb mm0, mm7
                                            packuswb mm0, mm7
     movd dword ptr [r8], mm0
                                            movd dword ptr [r8], mm0
     add r8, 0x4
                                            add
                                                 r8, 0x4
     dec
          ecx
                                            dec
                                                 ecx
     ine
           .Ltmp0
                                            ine
                                                  .Ltmp0
```

### **Future Work**

- Integrating into <u>llvm-mca</u> (in particular frontend simulation)
- Genetic scheduler → MachineFunctionPass

### Try It Out!

https://github.com/google/EXEgesis/tree/master/llvm\_sim